Genezzo Cluster Hardware

Introduction

I am currently extending the Genezzo database to support shared data clusters. My code is on CPAN at Genezzo::Contrib::Clustered. I have assembled my own simple cluster to support this development.

A shared data cluster consists of multiple independent servers sharing a set of common disk drives over a network. In a production deployment the disk sharing will be done over a separate network from the one used for general communications between the servers and clients. Until recently this separate Storage Area Network (SAN) was assembled using expensive Fibre Channel network adapters and switches. Recently it has become possible to assemble a SAN using standard ethernet adapters and switches.

A shared data cluster with a topology like that shown below provides numerous benefits when deploying an application like the Genezzo database. These benefits are in the areas of scalability, high availability, and most recently affordability. Several of these benefits arise because every server can equally access every shared disk. The cluster scales in compute power by adding servers, and in capacity by adding disks. The data stored by the cluster is highly available because single server failures don't prevent access to any of the data, and because the SAN disks can easily be set up in a RAID configuration to guard against disk failures. Finally, clusters built with commodity AMD/Intel processors, ATA/SATA hard drives, and ethernet adapters and switches are very affordable. The major drawback to shared data clusters is the greater complexity of the operating system(s) and applications required to utilize it. The holy grail at both the operating system and database level is the "single system image" – the illusion presented to the higher level applications and the users that the cluster is a single monolithic system and the underlying hardware complexity can safely be ignored.

AoE and Coraid EtherDrive Storage Blade

I have constructed my cluster using a Coraid EtherDrive Evaluation Kit. EtherDrive uses the ATA-over-Ethernet (AoE) protocol to connect ATA hard drives to servers using ordinary ethernet adapters and switches. Normally 10 blades holding hard drives are placed in a chassis. Coraid also sells an evaluation kit consisting of a single blade plus a small interface adapter card that provides power and ethernet connectivity. You provide the ATA hard drive (and your own enclosure, if desired). See pictures above.

AoE protocol support has been added to the latest Linux 2.6.11 kernel. Coraid provides open-source drivers for earlier 2.6 kernels and 2.4 kernels (2.4 kernel drivers only support a single partition per disk drive) and FreeBSD. On SourceForge aoetools (command-line utilities) and vblade (virtual AoE disk emulator) are available.

My current cluster configuration consists of 2 AMD Athlon servers (running Debian Linux with 2.6.11 kernels), one Intel Celeron server (running Debian Linux with a 2.6.12 kernel), and a single shared 200G ATA hard drive on the EtherDrive storage blade. The 2 servers and the blade are all connected through the same 100Mbit ethernet switch. I don't currently have a separate dedicated ethernet SAN connection for the blade. In a production environment the servers would have a second ethernet adapter connected to a separate ethernet switch for connectivity to the blade(s). This would provide dedicated bandwidth. It would also provide security since AoE is a OSI layer 2 non-routed protocol.

No Cluster File System

When a disk is directly shared between multiple servers a cluster file system is normally needed to arbitrate access to prevent file system corruption. Cluster file systems such as Red Hat's Global File System (GFS), OpenGFS, and Oracle Cluster File System (OCFS) are available for Linux, but none appear ready to run on Debian with a 2.6 kernel. When running on a cluster the Genezzo database system will maintain its own buffer cache, manage its own free space and metadata storage, and provide its own distributed locking facilities (initially using OpenDLM). Thus it can instead run on Linux raw devices. Here a whole disk partition acts as a single file. The disk is read and written directly, bypassing the kernel's block buffer cache. This also eliminates fsync problems where blocks are cached in the kernel instead of written immediately to disk. The downside is that raw devices are much more primitive to maintain than file system devices. Clustered Oracle initially ran on raw devices but they later created OCFS to ease the maintenance headaches. GFS and/or OCFS may be integrated into the Linux kernel in a later release, and Genezzo could run on these file systems.

I have modified a pre-release version of Genezzo to use a raw device on the shared EtherDrive blade. Genezzo running on both servers is able to read and write this shared raw device.

EtherDrive Instructions

The following is an outline of the necessary steps to configure a Linux 2.6 server to access an EtherDrive blade. Most of the commands must be run as user root.

OpenDLM Build and Installation Notes

these will be expanded later...

Obtain the code here, and follow the directions here

I also installed the following Debian packages: libnet, libnet-dev, heartbeat, heartbeat-dev, glib, glib-dev, uuid, uuid-dev, libpils, kernel-headers-### (or linux-headers-###).

I skipped the bootstrap command (available in CVS, not in 0.9.2 or 0.9.3).

for me OpenDLM version 0.9.2 only compiled on Linux 2.4. I used the following configure command:

./configure --with-linux-srcdir=/usr/src/kernel-headers-2.4.27-2-386 
  --with-heartbeat_library=/usr/lib/libhbclient.so 
  --with-heartbeat_includes=/usr/include/heartbeat
for me OpenDLM version 0.9.3 only compiled on Linux 2.6. I used the following configure command:
./configure --with-heartbeat 
  --with-linux-srcdir=/usr/src/kernel-headers-2.6.11-1-k7
on Linux 2.6 editing /etc/modules.conf did not help. Instead
cd /lib/modules/2.6.11-1-k7/extra
ln -s dlmdk_core.ko haDLM.ko
depmod -a